Non-Personalized Recommenders Assignment

Overview

This assignment will explore non-personalized recommendations. You will be given a 20x20 matrix where columns represent movies, rows represent users, and each cell represents a user-movie rating.

Deliverables

There are 4 deliverables for this assignment. Each deliverable represents a different analysis of the data provided to you. For each deliverable, you will submit a list of the top 5 movies as ranked by a particular metric. The 4 metrics are:

Mean Rating: Calculate the mean rating for each movie, order with the highest rating listed first, and submit the top 5.
% of ratings 4+: Calculate the percentage of ratings for each movie that are 4 or higher. Order with the highest percentage first, and submit the top 5.
Rating Count: Count the number of ratings for each movie, order with the most number of ratings first, and submit the top 5.
Top 5 Star Wars: Calculate movies that most often occur with Star Wars: Episode IV - A New Hope (1977) using the (x+y)/x method described in class. In other words, for each movie, calculate the percentage of Star Wars raters who also rated that movie. Order with the highest percentage first, and submit the top 5.

Importing Libraries



In [114]:

    
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Loading the Data



In [115]:

    
# Loading the data into a Pandas dataframe
movie_data = pd.read_csv('A1Ratings.csv')



In [116]:

    
# Looking at the first 5 rows of the dataframe
movie_data.head()









    Out[116]:






  
    
      
      User
      260: Star Wars: Episode IV - A New Hope (1977)
      1210: Star Wars: Episode VI - Return of the Jedi (1983)
      356: Forrest Gump (1994)
      318: Shawshank Redemption, The (1994)
      593: Silence of the Lambs, The (1991)
      3578: Gladiator (2000)
      1: Toy Story (1995)
      2028: Saving Private Ryan (1998)
      296: Pulp Fiction (1994)
      ...
      2396: Shakespeare in Love (1998)
      2916: Total Recall (1990)
      780: Independence Day (ID4) (1996)
      541: Blade Runner (1982)
      1265: Groundhog Day (1993)
      2571: Matrix, The (1999)
      527: Schindler's List (1993)
      2762: Sixth Sense, The (1999)
      1198: Raiders of the Lost Ark (1981)
      34: Babe (1995)
    
  
  
    
      0
      755
      1
      5
      2
      NaN
      4
      4
      2
      2
      NaN
      ...
      2
      NaN
      5
      2
      NaN
      4
      2
      5
      NaN
      NaN
    
    
      1
      5277
      5
      3
      NaN
      2
      4
      2
      1
      NaN
      NaN
      ...
      3
      2
      2
      NaN
      2
      NaN
      5
      1
      3
      NaN
    
    
      2
      1577
      NaN
      NaN
      NaN
      5
      2
      NaN
      4
      NaN
      NaN
      ...
      NaN
      1
      4
      4
      1
      1
      2
      3
      1
      3
    
    
      3
      4388
      NaN
      3
      NaN
      NaN
      NaN
      1
      2
      3
      4
      ...
      NaN
      4
      1
      3
      5
      NaN
      5
      1
      1
      2
    
    
      4
      1202
      4
      3
      4
      1
      4
      1
      NaN
      4
      NaN
      ...
      5
      1
      NaN
      4
      NaN
      3
      5
      5
      NaN
      NaN
    
  

5 rows × 21 columns



In [117]:

    
#printing the column names of the dataframe
movie_data.columns









    Out[117]:





Index([u'User', u'260: Star Wars: Episode IV - A New Hope (1977)',
       u'1210: Star Wars: Episode VI - Return of the Jedi (1983)',
       u'356: Forrest Gump (1994)', u'318: Shawshank Redemption, The (1994)',
       u'593: Silence of the Lambs, The (1991)', u'3578: Gladiator (2000)',
       u'1: Toy Story (1995)', u'2028: Saving Private Ryan (1998)',
       u'296: Pulp Fiction (1994)', u'1259: Stand by Me (1986)',
       u'2396: Shakespeare in Love (1998)', u'2916: Total Recall (1990)',
       u'780: Independence Day (ID4) (1996)', u'541: Blade Runner (1982)',
       u'1265: Groundhog Day (1993)', u'2571: Matrix, The (1999)',
       u'527: Schindler's List (1993)', u'2762: Sixth Sense, The (1999)',
       u'1198: Raiders of the Lost Ark (1981)', u'34: Babe (1995)'],
      dtype='object')



In [118]:

    
# Summarizing the data in the movie_data dataframe
movie_data.describe()









    Out[118]:






  
    
      
      User
      260: Star Wars: Episode IV - A New Hope (1977)
      1210: Star Wars: Episode VI - Return of the Jedi (1983)
      356: Forrest Gump (1994)
      318: Shawshank Redemption, The (1994)
      593: Silence of the Lambs, The (1991)
      3578: Gladiator (2000)
      1: Toy Story (1995)
      2028: Saving Private Ryan (1998)
      296: Pulp Fiction (1994)
      ...
      2396: Shakespeare in Love (1998)
      2916: Total Recall (1990)
      780: Independence Day (ID4) (1996)
      541: Blade Runner (1982)
      1265: Groundhog Day (1993)
      2571: Matrix, The (1999)
      527: Schindler's List (1993)
      2762: Sixth Sense, The (1999)
      1198: Raiders of the Lost Ark (1981)
      34: Babe (1995)
    
  
  
    
      count
      20.000000
      15.000000
      14.000000
      10.000000
      10.000000
      16.00000
      12.000000
      17.000000
      11.000000
      11.000000
      ...
      11.000000
      12.000000
      13.000000
      9.000000
      12.000000
      12.000000
      12.000000
      12.000000
      11.000000
      10.000000
    
    
      mean
      3658.100000
      3.266667
      3.000000
      2.700000
      3.600000
      3.06250
      2.916667
      2.823529
      3.000000
      3.000000
      ...
      2.909091
      1.916667
      2.769231
      3.222222
      3.166667
      2.833333
      3.000000
      2.833333
      2.909091
      3.000000
    
    
      std
      1749.716756
      1.387015
      1.467599
      1.337494
      1.646545
      1.28938
      1.564279
      1.131111
      1.414214
      1.183216
      ...
      1.513575
      0.996205
      1.235168
      1.092906
      1.585923
      1.527525
      1.595448
      1.642245
      1.578261
      1.414214
    
    
      min
      139.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.00000
      1.000000
      1.000000
      1.000000
      1.000000
      ...
      1.000000
      1.000000
      1.000000
      2.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      25%
      2558.750000
      2.000000
      2.000000
      2.000000
      2.500000
      2.00000
      1.750000
      2.000000
      2.000000
      2.000000
      ...
      2.000000
      1.000000
      2.000000
      2.000000
      2.000000
      1.750000
      2.000000
      1.000000
      1.500000
      2.000000
    
    
      50%
      4252.500000
      4.000000
      3.000000
      2.500000
      4.000000
      3.00000
      3.000000
      2.000000
      3.000000
      3.000000
      ...
      3.000000
      2.000000
      3.000000
      3.000000
      3.000000
      2.500000
      2.500000
      3.000000
      3.000000
      2.500000
    
    
      75%
      4916.250000
      4.000000
      4.000000
      3.750000
      5.000000
      4.00000
      4.000000
      4.000000
      4.000000
      4.000000
      ...
      4.000000
      2.250000
      4.000000
      4.000000
      5.000000
      4.000000
      5.000000
      4.250000
      4.000000
      4.000000
    
    
      max
      6037.000000
      5.000000
      5.000000
      5.000000
      5.000000
      5.00000
      5.000000
      5.000000
      5.000000
      5.000000
      ...
      5.000000
      4.000000
      5.000000
      5.000000
      5.000000
      5.000000
      5.000000
      5.000000
      5.000000
      5.000000
    
  

8 rows × 21 columns

Non-Personalized Recommenders for Raiders of the Lost Ark



In [119]:

    
# Storing the "1198: Raiders of the Lost Ark (1981)" data into an array
raid_lost_arc = movie_data["1198: Raiders of the Lost Ark (1981)"]
raid_lost_arc









    Out[119]:





0    NaN
1      3
2      1
3      1
4    NaN
5    NaN
6      5
7      5
8    NaN
9    NaN
10     1
11   NaN
12     5
13   NaN
14     3
15     3
16   NaN
17     2
18   NaN
19     3
Name: 1198: Raiders of the Lost Ark (1981), dtype: float64

Mean rating for Raiders of the Lost Ark (1981)



In [120]:

    
print '%.2f' % ( raid_lost_arc.mean() )

Number of non-NA ratings for Raiders of the Lost Ark (1981)



In [121]:

    
raid_lost_arc.count()









    Out[121]:





11

Percentage of ratings >=4 for Raiders of the Lost Ark (1981)



In [122]:

    
print '%.1f' % ( (len(raid_lost_arc[raid_lost_arc>=4])/float(raid_lost_arc.count()))*100.0 )

Finding Association of Raiders of the Lost Ark (1981) with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH Raiders of the Lost Ark (1981) and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV.



In [123]:

    
# First, storing the Star Wars count
star_wars_count = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count()



In [124]:

    
# Then multiply the Raiders of the Lost Ark and Star Wars data.
# non-NA values will be the ones where both entries do not have NA. Then, count these entries
rad_arc_star_wars_count = (movie_data["1198: Raiders of the Lost Ark (1981)"]*movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]).count()

Printing the Association of Raiders of the Lost Ark (1981) and Star Wars Episode IV



In [125]:

    
print '%.1f' % ( (rad_arc_star_wars_count/float(star_wars_count))*100.0 )

Finding top 5 movies with the highest ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the mean rating for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.



In [126]:

    
rating_means = pd.Series([movie_data[col_name].mean() for col_name in movie_data.columns[1:]], 
                         index=movie_data.columns[1:])

Printing the top 5 rated movies



In [127]:

    
rating_means.sort_values(ascending=False)[0:5]









    Out[127]:





318: Shawshank Redemption, The (1994)             3.600000
260: Star Wars: Episode IV - A New Hope (1977)    3.266667
541: Blade Runner (1982)                          3.222222
1265: Groundhog Day (1993)                        3.166667
593: Silence of the Lambs, The (1991)             3.062500
dtype: float64

Finding top 5 movies with the most ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.



In [128]:

    
rating_count = pd.Series([movie_data[col_name].count() for col_name in movie_data.columns[1:]], 
                         index=movie_data.columns[1:])

Printing the top 5 movies with the most ratings



In [129]:

    
rating_count.sort_values(ascending=False)[0:5]









    Out[129]:





1: Toy Story (1995)                                        17
593: Silence of the Lambs, The (1991)                      16
260: Star Wars: Episode IV - A New Hope (1977)             15
1210: Star Wars: Episode VI - Return of the Jedi (1983)    14
780: Independence Day (ID4) (1996)                         13
dtype: int64

Top 5 movies with Percentage of ratings >=4



In [130]:

    
rating_positive = pd.Series([sum(movie_data[col_name]>=4)/float(movie_data[col_name].count()) for col_name in movie_data.columns[1:]], 
                             index=movie_data.columns[1:])

Printing Top 5 movies with Percentage of ratings >=4



In [131]:

    
rating_positive.sort_values(ascending=False)[0:5]









    Out[131]:





318: Shawshank Redemption, The (1994)             0.700000
260: Star Wars: Episode IV - A New Hope (1977)    0.533333
3578: Gladiator (2000)                            0.500000
541: Blade Runner (1982)                          0.444444
593: Silence of the Lambs, The (1991)             0.437500
dtype: float64

Top 5 movies most similar to Star Wars (movie id =260)



In [132]:

    
# First, storing the Star Wars ratings and the count of non-NA Star Wars ratings
star_wars_rat = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]
star_wars_count = float(movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count())
print star_wars_count

Finding Association of all movies with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH movie i and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV. Below, we are looping over [2:] to not include Star Wars Episode IV in the Association calculation.



In [133]:

    
sim_val = pd.Series( [ (movie_data[col_name]*star_wars_rat).count()/star_wars_count 
                     for col_name in movie_data.columns[2:] ], index=movie_data.columns[2:] )

Printing Top 5 movies most similar to Star Wars (movie id =260)



In [134]:

    
sim_val.sort_values(ascending=False)[0:5]









    Out[134]:





1: Toy Story (1995)                                        0.933333
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.866667
593: Silence of the Lambs, The (1991)                      0.800000
780: Independence Day (ID4) (1996)                         0.733333
2916: Total Recall (1990)                                  0.666667
dtype: float64



In [ ]:

	User	260: Star Wars: Episode IV - A New Hope (1977)	1210: Star Wars: Episode VI - Return of the Jedi (1983)	356: Forrest Gump (1994)	318: Shawshank Redemption, The (1994)	593: Silence of the Lambs, The (1991)	3578: Gladiator (2000)	1: Toy Story (1995)	2028: Saving Private Ryan (1998)	296: Pulp Fiction (1994)	...	2396: Shakespeare in Love (1998)	2916: Total Recall (1990)	780: Independence Day (ID4) (1996)	541: Blade Runner (1982)	1265: Groundhog Day (1993)	2571: Matrix, The (1999)	527: Schindler's List (1993)	2762: Sixth Sense, The (1999)	1198: Raiders of the Lost Ark (1981)	34: Babe (1995)
0	755	1	5	2	NaN	4	4	2	2	NaN	...	2	NaN	5	2	NaN	4	2	5	NaN	NaN
1	5277	5	3	NaN	2	4	2	1	NaN	NaN	...	3	2	2	NaN	2	NaN	5	1	3	NaN
2	1577	NaN	NaN	NaN	5	2	NaN	4	NaN	NaN	...	NaN	1	4	4	1	1	2	3	1	3
3	4388	NaN	3	NaN	NaN	NaN	1	2	3	4	...	NaN	4	1	3	5	NaN	5	1	1	2
4	1202	4	3	4	1	4	1	NaN	4	NaN	...	5	1	NaN	4	NaN	3	5	5	NaN	NaN

	User	260: Star Wars: Episode IV - A New Hope (1977)	1210: Star Wars: Episode VI - Return of the Jedi (1983)	356: Forrest Gump (1994)	318: Shawshank Redemption, The (1994)	593: Silence of the Lambs, The (1991)	3578: Gladiator (2000)	1: Toy Story (1995)	2028: Saving Private Ryan (1998)	296: Pulp Fiction (1994)	...	2396: Shakespeare in Love (1998)	2916: Total Recall (1990)	780: Independence Day (ID4) (1996)	541: Blade Runner (1982)	1265: Groundhog Day (1993)	2571: Matrix, The (1999)	527: Schindler's List (1993)	2762: Sixth Sense, The (1999)	1198: Raiders of the Lost Ark (1981)	34: Babe (1995)
count	20.000000	15.000000	14.000000	10.000000	10.000000	16.00000	12.000000	17.000000	11.000000	11.000000	...	11.000000	12.000000	13.000000	9.000000	12.000000	12.000000	12.000000	12.000000	11.000000	10.000000
mean	3658.100000	3.266667	3.000000	2.700000	3.600000	3.06250	2.916667	2.823529	3.000000	3.000000	...	2.909091	1.916667	2.769231	3.222222	3.166667	2.833333	3.000000	2.833333	2.909091	3.000000
std	1749.716756	1.387015	1.467599	1.337494	1.646545	1.28938	1.564279	1.131111	1.414214	1.183216	...	1.513575	0.996205	1.235168	1.092906	1.585923	1.527525	1.595448	1.642245	1.578261	1.414214
min	139.000000	1.000000	1.000000	1.000000	1.000000	1.00000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	2.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
25%	2558.750000	2.000000	2.000000	2.000000	2.500000	2.00000	1.750000	2.000000	2.000000	2.000000	...	2.000000	1.000000	2.000000	2.000000	2.000000	1.750000	2.000000	1.000000	1.500000	2.000000
50%	4252.500000	4.000000	3.000000	2.500000	4.000000	3.00000	3.000000	2.000000	3.000000	3.000000	...	3.000000	2.000000	3.000000	3.000000	3.000000	2.500000	2.500000	3.000000	3.000000	2.500000
75%	4916.250000	4.000000	4.000000	3.750000	5.000000	4.00000	4.000000	4.000000	4.000000	4.000000	...	4.000000	2.250000	4.000000	4.000000	5.000000	4.000000	5.000000	4.250000	4.000000	4.000000
max	6037.000000	5.000000	5.000000	5.000000	5.000000	5.00000	5.000000	5.000000	5.000000	5.000000	...	5.000000	4.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000